Goto

Collaborating Authors

 Frisco


Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning

Lee, Kevin, Spiewak, Russell, Walsh, James

arXiv.org Artificial Intelligence

Scientific reasoning through Large Language Models in heliophysics involves more than just recalling facts: it requires incorporating physical assumptions, maintaining consistent units, and providing clear scientific formats through coordinated approaches. To address these challenges, we present Reasoning With a Star, a newly contributed heliophysics dataset applicable to reasoning; we also provide an initial benchmarking approach. Our data are constructed from National Aeronautics and Space Administration & University Corporation for Atmospheric Research Living With a Star summer school problem sets and compiled into a readily consumable question-and-answer structure with question contexts, reasoning steps, expected answer type, ground-truth targets, format hints, and metadata. A programmatic grader checks the predictions using unit-aware numerical tolerance, symbolic equivalence, and schema validation. We benchmark a single-shot baseline and four multi-agent patterns, finding that decomposing workflows through systems engineering principles outperforms direct prompting on problems requiring deductive reasoning rather than pure inductive recall.


6d0f9c415e2d779c78f32b74668e9d02-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

Fact-checking is extensively studied in the context of misinformation and disinformation, addressing objective inaccuracies. However, a softer form of misinformation involves responses that are factually correct but lack certain features such as clarity and relevance. This challenge is prevalent in formal Question-Answer (QA) settings such as press conferences in finance, politics, sports, and other domains, where subjective answers can obscure transparency. Despite this, there is a lack of manually annotated datasets for subjective features across multiple dimensions. To address this gap, we introduce SubjECTive-QA, a human annotated dataset on Earnings Call Transcripts' (ECTs) QA sessions as the answers given by company representatives are often open to subjective interpretations and scrutiny. The dataset includes 49, 446 annotations for long-form QA pairs across six features: Assertive, Cautious, Optimistic, Specific, Clear, and Relevant . These features are carefully selected to encompass the key attributes that reflect the tone of the answers provided during QA sessions across different domains. Our findings are that the best-performing Pre-trained Language Model (PLM), RoBERTa-base, has similar weighted F1 scores to Llama-3-70b-Chat on features with lower subjectivity, such as Relevant and Clear, with a mean difference of 2 .


An Evaluation Framework for Network IDS/IPS Datasets: Leveraging MITRE ATT&CK and Industry Relevance Metrics

Tori, Adrita Rahman, Hasan, Khondokar Fida

arXiv.org Artificial Intelligence

The performance of Machine Learning (ML) and Deep Learning (DL)-based Intrusion Detection and Prevention Systems (IDS/IPS) is critically dependent on the relevance and quality of the datasets used for training and evaluation. However, current AI model evaluation practices for developing IDS/IPS focus predominantly on accuracy metrics, often overlooking whether datasets represent industry-specific threats. To address this gap, we introduce a novel multi-dimensional framework that integrates the MITRE ATT&CK knowledge base for threat intelligence and employs five complementary metrics that together provide a comprehensive assessment of dataset suitability. Methodologically, this framework combines threat intelligence, natural language processing, and quantitative analysis to assess the suitability of datasets for specific industry contexts. Applying this framework to nine publicly available IDS/IPS datasets reveals significant gaps in threat coverage, particularly in the healthcare, energy, and financial sectors. In particular, recent datasets (e.g., CIC-IoMT, CIC-UNSW-NB15) align better with sector-specific threats, whereas others, like CICIoV-24, underperform despite their recency. Our findings provide a standardized, interpretable approach for selecting datasets aligned with sector-specific operational requirements, ultimately enhancing the real-world effectiveness of AI-driven IDS/IPS deployments. The efficiency and practicality of the framework are validated through deployment in a real-world case study, underscoring its capacity to inform dataset selection and enhance the effectiveness of AI-driven IDS/IPS in operational environments.


6d0f9c415e2d779c78f32b74668e9d02-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

Fact-checking is extensively studied in the context of misinformation and disinformation, addressing objective inaccuracies. However, a softer form of misinformation involves responses that are factually correct but lack certain features such as clarity and relevance. This challenge is prevalent in formal Question-Answer (QA) settings such as press conferences in finance, politics, sports, and other domains, where subjective answers can obscure transparency. Despite this, there is a lack of manually annotated datasets for subjective features across multiple dimensions. To address this gap, we introduce SubjECTive-QA, a human annotated dataset on Earnings Call Transcripts' (ECTs) QA sessions as the answers given by company representatives are often open to subjective interpretations and scrutiny. The dataset includes 49, 446 annotations for long-form QA pairs across six features: Assertive, Cautious, Optimistic, Specific, Clear, and Relevant . These features are carefully selected to encompass the key attributes that reflect the tone of the answers provided during QA sessions across different domains. Our findings are that the best-performing Pre-trained Language Model (PLM), RoBERTa-base, has similar weighted F1 scores to Llama-3-70b-Chat on features with lower subjectivity, such as Relevant and Clear, with a mean difference of 2 .


Comparative Analysis of FOLD-SE vs. FOLD-R++ in Binary Classification and XGBoost in Multi-Category Classification

Murthy, Akshay, Sebastian, Shawn, Shangle, Manil, Wang, Huaduo, Dasgupta, Sopam, Gupta, Gopal

arXiv.org Artificial Intelligence

Recently, the demand for Machine Learning (ML) models that can balance accuracy, efficiency, and interpreability has grown significantly. Traditionally, there has been a tradeoff between accuracy and explainability in predictive models, with models such as Neural Networks achieving high accuracy on complex datasets while sacrificing internal transparency. As such, new rule-based algorithms such as FOLD-SE have been developed that provide tangible justification for predictions in the form of interpretable rule sets. The primary objective of this study was to compare FOLD-SE and FOLD-R++, both rule-based classifiers, in binary classification and evaluate how FOLD-SE performs against XGBoost, a widely used ensemble classifier, when applied to multi-category classification. We hypothesized that because FOLD-SE can generate a condensed rule set in a more explainable manner, it would lose upwards of an average of 3 percent in accuracy and F1 score when compared with XGBoost and FOLD-R++ in multiclass and binary classification, respectively. The research used data collections for classification, with accuracy, F1 scores, and processing time as the primary performance measures. Outcomes show that FOLD-SE is superior to FOLD-R++ in terms of binary classification by offering fewer rules but losing a minor percentage of accuracy and efficiency in processing time; in tasks that involve multi-category classifications, FOLD-SE is more precise and far more efficient compared to XGBoost, in addition to generating a comprehensible rule set. The results point out that FOLD-SE is a better choice for both binary tasks and classifications with multiple categories. Therefore, these results demonstrate that rule-based approaches like FOLD-SE can bridge the gap between explainability and performance, highlighting their potential as viable alternatives to black-box models in diverse classification tasks.


DoorDash plans to test drone deliveries in San Francisco warehouse

Los Angeles Times

Things to Do in L.A. Tap to enable a layout that focuses on the article. Masslie Arias, of DoorDash, prepares to load a delivery package on a hovering drone on July 31 in Frisco, Texas. This is read by an automated voice. Please report any issues or inconsistencies here . Food delivery app DoorDash is setting its sights on a new destination to test out flying drone deliveries: San Francisco.


Fast Estimation of Globally Optimal Independent Contact Regions for Robust Grasping and Manipulation

King, Jonathan P., Ahluwalia, Harnoor, Zhang, Michael, Pollard, Nancy S.

arXiv.org Artificial Intelligence

-- This work presents a fast anytime algorithm for computing globally optimal independent contact regions (ICRs). ICRs are regions such that one contact within each region enables a valid grasp. Locations of ICRs can provide guidance for grasp and manipulation planning, learning, and policy transfer . However, ICRs for modern applications have been little explored, in part due to the expense of computing them, as they have a search space exponential in the number of contacts. We present a divide and conquer algorithm based on incremental n-dimensional Delaunay triangulation that produces results with bounded suboptimality in times sufficient for real-time planning. This paper presents the base algorithm for grasps where contacts lie within a plane. Our experiments show substantial benefits over competing grasp quality metrics and speedups of 100X and more for competing approaches to computing ICRs. We explore robustness of a policy guided by ICRs and outline a path to general 3D implementation. Code will be released on publication to facilitate further development and applications.


Prosody as a Teaching Signal for Agent Learning: Exploratory Studies and Algorithmic Implications

Knierim, Matilda, Jain, Sahil, Aydoğan, Murat Han, Mitra, Kenneth, Desai, Kush, Saran, Akanksha, Baraka, Kim

arXiv.org Artificial Intelligence

Agent learning from human interaction often relies on explicit signals, but implicit social cues, such as prosody in speech, could provide valuable information for more effective learning. This paper advocates for the integration of prosody as a teaching signal to enhance agent learning from human teachers. Through two exploratory studies--one examining voice feedback in an interactive reinforcement learning setup and the other analyzing restricted audio from human demonstrations in three Atari games--we demonstrate that prosody carries significant information about task dynamics. Our findings suggest that prosodic features, when coupled with explicit feedback, can enhance reinforcement learning outcomes. Moreover, we propose guidelines for prosody-sensitive algorithm design and discuss insights into teaching behavior. Our work underscores the potential of leveraging prosody as an implicit signal for more efficient agent learning, thus advancing human-agent interaction paradigms.


SubjECTive-QA: Measuring Subjectivity in Earnings Call Transcripts' QA Through Six-Dimensional Feature Analysis

Pardawala, Huzaifa, Sukhani, Siddhant, Shah, Agam, Kejriwal, Veer, Pillai, Abhishek, Bhasin, Rohan, DiBiasio, Andrew, Mandapati, Tarun, Adha, Dhruv, Chava, Sudheer

arXiv.org Artificial Intelligence

Fact-checking is extensively studied in the context of misinformation and disinformation, addressing objective inaccuracies. However, a softer form of misinformation involves responses that are factually correct but lack certain features such as clarity and relevance. This challenge is prevalent in formal Question-Answer (QA) settings such as press conferences in finance, politics, sports, and other domains, where subjective answers can obscure transparency. Despite this, there is a lack of manually annotated datasets for subjective features across multiple dimensions. To address this gap, we introduce SubjECTive-QA, a human annotated dataset on Earnings Call Transcripts' (ECTs) QA sessions as the answers given by company representatives are often open to subjective interpretations and scrutiny. The dataset includes 49,446 annotations for long-form QA pairs across six features: Assertive, Cautious, Optimistic, Specific, Clear, and Relevant. These features are carefully selected to encompass the key attributes that reflect the tone of the answers provided during QA sessions across different domain. Our findings are that the best-performing Pre-trained Language Model (PLM), RoBERTa-base, has similar weighted F1 scores to Llama-3-70b-Chat on features with lower subjectivity, such as Relevant and Clear, with a mean difference of 2.17% in their weighted F1 scores. The models perform significantly better on features with higher subjectivity, such as Specific and Assertive, with a mean difference of 10.01% in their weighted F1 scores. Furthermore, testing SubjECTive-QA's generalizability using QAs from White House Press Briefings and Gaggles yields an average weighted F1 score of 65.97% using our best models for each feature, demonstrating broader applicability beyond the financial domain. SubjECTive-QA is publicly available under the CC BY 4.0 license


Improving Academic Skills Assessment with NLP and Ensemble Learning

Huang, Xinyi, Wu, Yingyi, Zhang, Danyang, Hu, Jiacheng, Long, Yujian

arXiv.org Artificial Intelligence

This study addresses the critical challenges of assessing foundational academic skills by leveraging advancements in natural language processing (NLP). Traditional assessment methods often struggle to provide timely and comprehensive feedback on key cognitive and linguistic aspects, such as coherence, syntax, and analytical reasoning. Our approach integrates multiple state-of-the-art NLP models, including BERT, RoBERTa, BART, DeBERTa, and T5, within an ensemble learning framework. These models are combined through stacking techniques using LightGBM and Ridge regression to enhance predictive accuracy. The methodology involves detailed data preprocessing, feature extraction, and pseudo-label learning to optimize model performance. By incorporating sophisticated NLP techniques and ensemble learning, this study significantly improves the accuracy and efficiency of assessments, offering a robust solution that surpasses traditional methods and opens new avenues for educational technology research focused on enhancing core academic competencies.